cvf international conference
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Li, Yunxin, Liu, Zhenyu, Li, Zitao, Zhang, Xuanyu, Xu, Zhenran, Chen, Xinyu, Shi, Haoyuan, Jiang, Shenyuan, Wang, Xintong, Wang, Jifang, Huang, Shouzheng, Zhao, Xinping, Jiang, Borui, Hong, Lanqing, Wang, Longyue, Tian, Zhuotao, Huai, Baoxing, Luo, Wenhan, Luo, Weihua, Zhang, Zheng, Hu, Baotian, Zhang, Min
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.27)
- Europe > Austria > Vienna (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.14)
- (37 more...)
- Overview (1.00)
- Research Report (0.86)
- Media (1.00)
- Education > Educational Setting (1.00)
- Information Technology (0.92)
- (2 more...)
MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
Xu, Chejian, Zhang, Jiawei, Chen, Zhaorun, Xie, Chulin, Kang, Mintong, Potter, Yujin, Wang, Zhun, Yuan, Zhuowen, Xiong, Alexander, Xiong, Zidi, Zhang, Chenhui, Yuan, Lingzhi, Zeng, Yi, Xu, Peiyang, Guo, Chengquan, Zhou, Andy, Tan, Jeffrey Ziwei, Zhao, Xuandong, Pinto, Francesco, Xiang, Zhen, Gai, Yu, Lin, Zinan, Hendrycks, Dan, Li, Bo, Song, Dawn
Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
- Europe > Switzerland > Zürich > Zürich (0.13)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Oceania (0.04)
- (6 more...)
- Transportation > Ground > Road (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- (4 more...)
Unlocking Generalization for Robotics via Modularity and Scale
How can we build generalist robot systems? Scale may not be enough due to the significant multimodality of robotics tasks, lack of easily accessible data and the challenges of deploying on physical hardware. Meanwhile, most deployed robotic systems today are inherently modular and can leverage the independent generalization capabilities of each module to perform well. Therefore, this thesis seeks to tackle the task of building generalist robot agents by integrating these components into one: combining modularity with large-scale learning for general purpose robot control. The first question we consider is: how can we build modularity and hierarchy into learning systems? Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning to enable more efficient and capable robot learners. Next, we come to the role of scale in building generalist robot systems. To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data. We leverage a powerful supervision source: classical planning, which can generalize, but is expensive to run and requires access to privileged information to perform well in practice. We use these planners to supervise large-scale policy learning in simulation to produce generalist agents. Finally, we consider how to unify modularity with large-scale policy learning to build real-world robot systems capable of performing zero-shot manipulation. We do so by tightly integrating key ingredients of modular high and mid-level planning, learned local control, procedural scene generation and large-scale policy learning for sim2real transfer. We demonstrate that this recipe can produce a single, generalist agent that can solve challenging long-horizon manipulation tasks in the real world.
- North America > United States > New York (0.14)
- Europe > Germany (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.13)
- (2 more...)
- Research Report > New Finding (1.00)
- Instructional Material (0.92)
- Research Report > Promising Solution (0.67)
- Education (1.00)
- Energy > Oil & Gas (0.67)
- Leisure & Entertainment > Sports (0.45)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- (6 more...)
Comprehensive Exploration of Synthetic Data Generation: A Survey
Bauer, André, Trapp, Simon, Stenger, Michael, Leppich, Robert, Kounev, Samuel, Leznik, Mark, Chard, Kyle, Foster, Ian
Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.13)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (11 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Media > Music (1.00)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Law (1.00)
- (6 more...)
Holistic Evaluation of GPT-4V for Biomedical Imaging
Liu, Zhengliang, Jiang, Hanqi, Zhong, Tianyang, Wu, Zihao, Ma, Chong, Li, Yiwei, Yu, Xiaowei, Zhang, Yutong, Pan, Yi, Shu, Peng, Lyu, Yanjun, Zhang, Lu, Yao, Junjie, Dong, Peixin, Cao, Chao, Xiao, Zhenxiang, Wang, Jiaqi, Zhao, Huan, Xu, Shaochen, Wei, Yaonai, Chen, Jingyuan, Dai, Haixing, Wang, Peilong, He, Hao, Wang, Zewei, Wang, Xinyu, Zhang, Xu, Zhao, Lin, Liu, Yiheng, Zhang, Kai, Yan, Liheng, Sun, Lichao, Liu, Jun, Qiang, Ning, Ge, Bao, Cai, Xiaoyan, Zhao, Shijie, Hu, Xintao, Yuan, Yixuan, Li, Gang, Zhang, Shu, Zhang, Xin, Jiang, Xi, Zhang, Tuo, Shen, Dinggang, Li, Quanzheng, Liu, Wei, Li, Xiang, Zhu, Dajiang, Liu, Tianming
In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.
- Europe > Switzerland > Zürich > Zürich (0.13)
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- (13 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.46)
Multimodal Foundation Models: From Specialists to General-Purpose Assistants
Li, Chunyuan, Gan, Zhe, Yang, Zhengyuan, Yang, Jianwei, Li, Linjie, Wang, Lijuan, Gao, Jianfeng
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants. The research landscape encompasses five core topics, categorized into two classes. (i) We start with a survey of well-established research areas: multimodal foundation models pre-trained for specific purposes, including two topics -- methods of learning vision backbones for visual understanding and text-to-image generation. (ii) Then, we present recent advances in exploratory, open research areas: multimodal foundation models that aim to play the role of general-purpose assistants, including three topics -- unified vision models inspired by large language models (LLMs), end-to-end training of multimodal LLMs, and chaining multimodal tools with LLMs. The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities who are eager to learn the basics and recent advances in multimodal foundation models.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Poland (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Overview (1.00)
- Instructional Material (1.00)
- Leisure & Entertainment (1.00)
- Education (1.00)
- Transportation > Passenger (0.92)
- (3 more...)
Interactive Natural Language Processing
Wang, Zekun, Zhang, Ge, Yang, Kexin, Shi, Ning, Zhou, Wangchunshu, Hao, Shaochun, Xiong, Guangzheng, Li, Yizhi, Sim, Mong Yuan, Chen, Xiuying, Zhu, Qingqing, Yang, Zhenzhu, Nik, Adam, Liu, Qi, Lin, Chenghua, Wang, Shi, Liu, Ruibo, Chen, Wenhu, Xu, Ke, Liu, Dayiheng, Guo, Yike, Fu, Jie
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP, aimed at addressing limitations in existing frameworks while aligning with the ultimate goals of artificial intelligence. This paradigm considers language models as agents capable of observing, acting, and receiving feedback iteratively from external entities. Specifically, language models in this context can: (1) interact with humans for better understanding and addressing user needs, personalizing responses, aligning with human values, and improving the overall user experience; (2) interact with knowledge bases for enriching language representations with factual knowledge, enhancing the contextual relevance of responses, and dynamically leveraging external information to generate more accurate and informed responses; (3) interact with models and tools for effectively decomposing and addressing complex tasks, leveraging specialized expertise for specific subtasks, and fostering the simulation of social behaviors; and (4) interact with environments for learning grounded representations of language, and effectively tackling embodied tasks such as reasoning, planning, and decision-making in response to environmental observations. This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept. We then provide a systematic classification of iNLP, dissecting its various components, including interactive objects, interaction interfaces, and interaction methods. We proceed to delve into the evaluation methodologies used in the field, explore its diverse applications, scrutinize its ethical and safety issues, and discuss prospective research directions. This survey serves as an entry point for researchers who are interested in this rapidly evolving area and offers a broad view of the current landscape and future trajectory of iNLP.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.27)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.13)
- (43 more...)
- Overview (1.00)
- Instructional Material > Course Syllabus & Notes (0.67)
- Research Report > Promising Solution (0.45)
- Media (1.00)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Education > Curriculum (0.92)
- (5 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
- (8 more...)
On the Generalization of Learned Structured Representations
Despite tremendous progress over the past decade, deep learning methods generally fall short of human-level systematic generalization. It has been argued that explicitly capturing the underlying structure of data should allow connectionist systems to generalize in a more predictable and systematic manner. Indeed, evidence in humans suggests that interpreting the world in terms of symbol-like compositional entities may be crucial for intelligent behavior and high-level reasoning. Another common limitation of deep learning systems is that they require large amounts of training data, which can be expensive to obtain. In representation learning, large datasets are leveraged to learn generic data representations that may be useful for efficient learning of arbitrary downstream tasks. This thesis is about structured representation learning. We study methods that learn, with little or no supervision, representations of unstructured data that capture its hidden structure. In the first part of the thesis, we focus on representations that disentangle the explanatory factors of variation of the data. We scale up disentangled representation learning to a novel robotic dataset, and perform a systematic large-scale study on the role of pretrained representations for out-of-distribution generalization in downstream robotic tasks. The second part of this thesis focuses on object-centric representations, which capture the compositional structure of the input in terms of symbol-like entities, such as objects in visual scenes. Object-centric learning methods learn to form meaningful entities from unstructured input, enabling symbolic information processing on a connectionist substrate. In this study, we train a selection of methods on several common datasets, and investigate their usefulness for downstream tasks and their ability to generalize out of distribution.
- Asia (0.92)
- Europe > Germany > Baden-Württemberg (0.27)
- Europe > Denmark > Capital Region (0.27)
- North America > United States > Minnesota (0.27)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Energy > Oil & Gas (0.92)
- Leisure & Entertainment > Games > Computer Games (0.92)
- Education (0.67)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)
Coordinate Independent Convolutional Networks -- Isometry and Gauge Equivariant Convolutions on Riemannian Manifolds
Weiler, Maurice, Forré, Patrick, Verlinde, Erik, Welling, Max
Motivated by the vast success of deep convolutional networks, there is a great interest in generalizing convolutions to non-Euclidean manifolds. A major complication in comparison to flat spaces is that it is unclear in which alignment a convolution kernel should be applied on a manifold. The underlying reason for this ambiguity is that general manifolds do not come with a canonical choice of reference frames (gauge). Kernels and features therefore have to be expressed relative to arbitrary coordinates. We argue that the particular choice of coordinatization should not affect a network's inference -- it should be coordinate independent. A simultaneous demand for coordinate independence and weight sharing is shown to result in a requirement on the network to be equivariant under local gauge transformations (changes of local reference frames). The ambiguity of reference frames depends thereby on the G-structure of the manifold, such that the necessary level of gauge equivariance is prescribed by the corresponding structure group G. Coordinate independent convolutions are proven to be equivariant w.r.t. those isometries that are symmetries of the G-structure. The resulting theory is formulated in a coordinate free fashion in terms of fiber bundles. To exemplify the design of coordinate independent convolutions, we implement a convolutional network on the M\"obius strip. The generality of our differential geometric formulation of convolutional networks is demonstrated by an extensive literature review which explains a large number of Euclidean CNNs, spherical CNNs and CNNs on general surfaces as specific instances of coordinate independent convolutions.
- Overview (1.00)
- Research Report > New Finding (0.92)